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Abstract. A secure human identification protocol aims at authenticat- 
ing human users to a remote server when even the users' inputs are not 
hidden from an adversary. Recently, the authors proposed a human iden- 
tification protocol in the RSA Conference 2007, which is loosely based 
on the ability of humans to efficiently process an image. The advantage 
being that an automated adversary is not effective in attacking the pro- 
tocol without human assistance. This paper extends that work by trying 
to solve some of the open problems. First, we analyze the complexity of 
defeating the proposed protocols by quantifying the workload of a hu- 
man adversary. Secondly, we propose a new construction based on textual 
CAPTCHAs (Reverse Turing Tests) in order to make the generation of 
automated challenges easier. We also present a brief experiment involv- 
ing real human users to find out the number of possible attributes in a 
given image and give some guidelines for the selection of challenge ques- 
tions based on the results. Finally, we analyze the previously proposed 
protocol in detail for the relationship between the secrets. Our results 
show that we can construct human identification protocols based on im- 
age evaluation with reasonably "quantified" security guarantees based 
on our model. 



1 Introduction 

Suppose a student wishes to write a confidential email to disclose the information 
of a leaked out exam paper to his friend. Using a secure email client the student 
writes down an email, sends it and logs out. The email shall be encrypted and 
would only be viewable by the recipient once he logs in to check the email. How 
can the student be sure that the email was sent securely and no one could learn 
anything apart from the intended recipient? The answer relies on the weakest 
security link: the password. Little did the student know that his computer was 
being key-logged [18]. There was a hidden camera looking at the student's every 
move. His fellow student also shoulder surfed on the password. Even if no one 
saw the mail being written, they could log on later to view the sent mail box. 



It turns out that no matter how secure the email client was, it only served its 
purpose until the password was not compromised. 

Similar situation occurs when one inputs PIN numbers on ATMs. We could 
use biometrics instead of passwords or pin numbers. But biometric data is only 
secure unless the biometric information is kept confidential and the equipment 
has not been tampered. So, are these mechanisms unsecure or useless? They 
certainly aren't. They were designed with certain assumptions in mind: The 
passwords selected by the user should be truly random strings of a suitable 
length; the pin numbers selected by the user should be truly random 4-digit 
numbers; the user has the responsibility to hide her input from peeping eyes. It 
has been an interesting topic of research in cryptography to devise authentication 
protocols that meet the security reequirements even when the above mentioned 
assumptions do not hold. This highly vulnerable security environment has been 
termed as the "naked human in a glass house" model in [16], although the 
first protocol constructed to be secure in this model was by Matsumoto [2]. We 
shall call such protocols as "Human Identification Protocols" or HIPs in short 
following the terminology in literature. A lot of human identification protocols 
have been proposed in literature with the goal of ease of "human execution" in 
mind. The protocols should take roughly the same amount of time as a password 
based protocol takes. Researchers have tried to construct protocols that are 
secure and require little or no computation on the human's part. This indeed is 
a very hard goal to achieve and reflects through the fact that there are not a lot 
of proposed human identification protocols over the years. 

The situation, however, is not as bad as it seems. Humans possess good 
cognitive abilities. We can recall a previously viewed image with a very high 
probability when presented to us again. We do not need to memorise all the 
details of the picture. This has led to the use of graphics as essential ingredients 
of human identification protocols. Proposed identification protocols can be im- 
plemented with a graphical implementation. For example, instead of memorizing 
the string 00101, we can display five pictures, each time, where the third and the 
fifth picture is shown to the user apriori as the user's secret pictures. We could 
go a step further and use the things that an image describes to build human 
identification protocols. In [1] we proposed that instead of using pictures just 
as memory aids, we can use their internal structure in some way to construct 
a human identification protocol. The secret could be one of the concepts that 
the picture satisfies. It was conjectured that it would be hard even for a human 
adversary to find the secret. However, no exact quantification of this hardness 
was given. While the hardness of breaking the protocol is clearly evident against 
automated adversaries (computer programs programmed to defeat the protocol), 
the security against "human adversaries" needs more attention. This study aims 
to enhance the work proposed in [1] by answering some of the open problems 
and moves a step further in trying to quantify the hardness of the underlying 
problem. We address the following issues: 



— We present a new protocol with the aim of generating automated instances 
of challenges. This construction is loosely based on the Gimpy CAPTCHAs 
[17] and requires the server to only maintain a dictionary of words. 

— We analyze the security of the protocol presented in [1] and the ones pre- 
sented in this paper with a new perspective. We show how much work has 
to be performed by a "human" adversary in order to obtain the secret based 
on our model. 

— We show the interrelationship of the two "secrets" of the protocol presented 
in [1]. 

— We also present another way of viewing the underlying problem of these 
protocols using matrix representation. This view helps to understand the 
principal hardness of the protocols presented. 

— Finally, we show the results of some experiments which show the amount of 
information in a very simple image. This data leads us to some guidelines 
while selecting the secret which we have also mentioned. 

Our study will deal with passive adversaries only. The reason being that if we 
deal with active adversaries in the protocols, then we might require the human 
user to send some random challenges to the remote server in order to authenti- 
cate it as well. This generation of random challenges requires an extra amount 
of computation from the human user's part, and might deem the protocols im- 
practical. This is left as a future work. 

2 Related Work 

The first work on human identification dates back to [2] . Since then a lot of other 
schemes have been proposed in literature [4] , [5] , [6] , [7] , [9] , [14] . Some of them were 
broken in [4], [15]. While most of them involve some numerical calculations like [2] 
and the HB protocol[7], they can be implemented using some graphical interface 
employing pictures as memory aids. We can categorized the human identifica- 
tion protocols into two broad categories: Protocols built to be secure against 
general eavesdropping adversaries and protocols secure against only "guessing 
adversaries" i.e. Adversaries who do not see the user's input and hence try to 
guess the secret or impersonate the user without any apriori knowledge. Proto- 
cols mentioned so far fall in the first category. They have a drawback, however, 
that they involve extra computation from the user. As an example, in the HB 
protocol [7], the user is required to compute bit- wise binary multiplication for 
some number of bits in every iteration. This may not seem much but to obtain 
a higher level of security, the number of computations increase significantly. 

In the second category, the most well known example is the traditional pass- 
word based authentication system. Others include purely graphical schemes like 
DcJa Vu[10],Passface [11], Point & Click [12] and [3] that require little or no nu- 
merical computation whatsoever. The basic theme of [10] and [11] is to present 
the user a series of pictures, a subset of which are the secret pictures. The user 
is authenticated if its selection of the secret pictures among the given set of pic- 
tures is correct. On the other hand, in [12] the user is authenticated if it clicks 



on the correct secret location in the given picture. [3] works similarly by letting 
the user draw the secret symbol or figure on a display device. Evidently, these 
purely graphical schemes are not secure against "peeping" attacks [9]. Anyone 
observing the actions of the user can find out the secret in no time. For a detailed 
account of all the schemes, see [9]. In our previous work [1], we proposed to use 
internal properties of images as secrets. After a secret has been chosen, pictures 
which satisfy the properties were presented randomly with the pictures that do 
not satisfy the property. The user has to answer the pictures according to the 
secret property. It was conjectured that finding out the secret property is a hard 
problem for adversaries. For automated adversaries, this follows immediately 
from the definition of CAPTCHAs[8]. But for human adversaries, this hardness 
is difficult to prove. In this paper, we have tried to quantify this hardness and 
tried to answer some of the open questions described above. 

3 The Main Idea 

Consider the picture of a magic square shown in Figure 1. How many things 
does the picture represent? A simple glance at it can reveal a lot of things: a 
magic square, a square, nine small squares, digits, black, white, the digit 4, the 
digit 2, linc(s), the right angle etc. It is amazing how much information does a 
rather simple looking picture contain. We call each piece of this information as 
a feature. We can take one of these features and construct a question out of it. 
For example: Does the picture contain a rectangle?. This question is shared as 
a secret between the server and the user. After that, a series of pictures can be 
presented to the user such that with probability 1/2 they satisfy the question 
and with probability 1/2 they don't. The user has to scan the picture and answer 
'yes' or 'no' accordingly. How about the adversary? The adversary would like to 
know the secret feature. The best way to do it is to extract all features in the 
pictures presented to the user and then do intersection (if user's answer is 'yes') 
or difference (if user's answer is 'no') to narrow down the number of possible 
secret features. 

From an abstract point of view, we may define a universal set of all features, 
all pictures could possibly have. Denote this set as S = {1, 2, ... , n}, where n is 
the total number of features. Any subset of this set represents a picture. Given 
a subset A of this set (which again is a picture) and its corresponding response 
bit a from the user, we would be interested in finding out how the adversary can 
find the hidden feature and hence the secret question. Obviously, it is almost 
impossible to write a computer program which given a picture can filter out 
all the features that picture describes considering the enormous size of the set 
of all features. This immediately implies the need of human intervention. So, 
what could be the best strategy to find out the hidden feature? One way is 
to have a cursory glance at the pictures and see if there is something common 
between the pictures. A more efficient way is to find out all the features in the 
pictures, and then check which features are common between the pictures. But 
once the features have been extracted, one could make a program that would 
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Fig. 1. A simple picture of a magic square may describe a lot of things 

automatically filter out the common features. Thus we will only be concerned 
with the workload on the human adversary which amounts to finding out all the 
features in the pictures until a single common feature is found. Our analysis will 
try to quantify the complexity of our proposed schemes based on the workload 
on human adversary using the mentioned approach. Namely, given a picture A 
and its corresponding answer bit a the adversary has to flag all the features as 
the candidate secret features if a = 1 or else delete all the features present in 
this picture from the candidate set of features if a = 0. Thus the adversary's job 
is to find out the features by personally checking every picture and narrow down 
the set of candidate secret features. 

We will now present protocols based on this main idea in the next section. Each 
protocol follows a short discussion and a brief account of the workload on the 
adversary. The detailed analysis follows in Section VI and VII. 

4 Identification Schemes 

Before we present the proposed schemes we present the general notation which 
will be used throughout this manuscript. We assume there to be a pool of dis- 
tinct pictures V, each element of which is denoted by P, and a set of questions 
Q. Each question q has a binary answer when applied to any picture in V . Each 
question therefore asks whether a certain feature is present in the picture or 
not. We will also use q to represent the function: q : V — > {0, 1}, which repre- 
sents the evaluation of a picture according to the question. The user's answer 
string is represented by A, where A (i) represents the ith bit in the string. We 



will use a to denote an arbitrary answer bit. From now onwards, the word "ad- 
versary" or the symbol Ti. would mean the "human" eavesdropping adversary, 
unless otherwise specified. The workload of H for each protocol will be based 
on the complexity of the above mentioned algorithm (or a slight variant of it) 
which is described in Section VI. Due to a bulk of notation used in this arti- 
cle, we will abuse this general notation in some sections or subsections without 
compromising disambiguity. 

4.1 The Basic Scheme 

We have the following immediate basic scheme: 

Setup. The user and the server share a secret question q from Q. 

Protocol. 

— Repeat k times 

• The server picks a bit b uniformly at random, and picks a picture P from 
V such that q (P) = b. Discards this picture from the pool, and presents 
it to the user. 

• The user submits a — q (P). 

— Output accept if all answer's are correct, otherwise output reject 

The scheme is described pictorically in Figure 2, where 2 pictures are shown 
at each iteration. We could present all k pictures at the same time depending 
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Fig. 2. The Basic Protocol with L — 2 and secret question q ="Does the picture 
contain a basketball?" 



on whether they can be displayed on the screen or not. In anycase we can find a 



number L such that we can have \k/L~\ iterations in one authentication session. 
According to the analysis in Section VI, the number of pictures the adversary 
has to observe to obtain the secret feature would be: log 2 n regardless of the 
value of L. This amounts to a total work of ^ log 2 n, which has to be done by 
the human adversary. Setting in the values n = 10 6 , we get a total work of « 2 23 
units. 

Discussion on the protocol This protocol is simple and practical. However, there 
is a big disadvantage. Namely, the user is sending its answers in the clear. An 
adversary thus knows the "correct" answer to all the pictures shown to the user. 
This makes the life of the adversary a bit easier since it can have a glance at 
the pictures to find out the common secret feature. We would like to somehow 
"hide" the answer sequence, thus making it hard for the adversary to guess the 
secret feature. The next protocol attempts to do that. This protocol was the 
original protocol presented in [1]. 

Matrix Interpretation Consider the set of all features S — {1,2, . . . ,n} as de- 
scribed in the previous section. Each picture contains a subset of this universal 
set of features. The secret question can also be considered as a subset of this set 
with a single element. We can represent the features of a presented picture i as 
a vector v, with all the features not present in the picture represented by O's, 
and the secret question as x with only one entry equal to 1. We can then write 
the answer bit as Vj • x = a. Thus if we have more pictures, we can represent the 
protocol as the matrix operation: Vx = a, where V is the matrix containing row 
vectors representing the features of the picture i, and a is the answer vector. 
The obvious way to solve this requires O (n) pictures and corresponding answers. 
However, as we will see in Section VI, not all pictures will contain features from 
the full feature space. Therefore, the actual number of picture and answer pairs 
required would be less than that. The real difficulty of the problem, however, is 
to find out the features in the pictures and thus construct the correct matrix V. 
Our matrix representation shows one way of describing the problem needed to 
be solved by H. 

4.2 The Enhanced Protocol 

Suppose we want to present L pictures at a time to the user. The pictures are 
labeled from 1 to L in sequential order. The idea is to permute the numbers 
randomly. Out of the resulting permutation, select I numbers, and label the 
others as "don't care positions" . This permutation string is also shared as a secret 
between the user and the server. Each time a series of L pictures are presented. 
The user answer's the pictures according to the order in the permutation string, 
and fills the "don't care positions" with random bits. So, for example, if L = 10 
and I = 5, then a possible permutation string would be * * 54 * 39 * *0, where 
the "*" represents a don't care position. We label such a permutation string 
as a. We now have two secrets in our scheme: the secret question q and the 
permutation string a. The user thus has to answer a series of pictures according 



to the function: q a : V — > {0, 1}. The user is accepted if the answers are correct 
at the I positions in each of the k iterations. The rest of the protocol is the same 
as the basic protocol. 

The protocol is described in Figure 3. Here we have taken L = 5, I = 3 and 
k = 1. The permutation string is 2*43* and the secret question is "Is the picture 
somehow related to computation?" . 




Fig. 3. The Enhanced Protocol with L = 5, I — 3 and k = 1 and secret question q ="Is 
the picture somehow related to computation?" 



In Section VI, we show that the total amount of work required by the ad- 
versary is: -7+r"§ log 2 n. Putting in the values L = 10, I = 5 and n = 10 6 , we 
get a total work load of w 2 27 . Which is significantly higher than the previous 
protocol. The adversary thus gets the secret question. But how about the secret 
permutation string a? The adversary is only successful in impersonating as the 
legitimate user if it knows a. It turns out that after the secret question has been 
revealed, it just takes a handful of iterations to observe to guess the permutation 
string with high probability. We show this in Section V. 

Discussion on the protocol Even though this protocol does have an advantage 
over the basic protocol in the sense that the correct answer sequence is shuffled, 
it seems hard to construct a method so as to hide the answer sequence completely 
with not too much effort on the user's part. It not only adds extra burden while 
answering the pictures, it also slows down the process. Secondly, in the two 
schemes proposed, there is a big question of practicality. How to automatically 
generate those pictures? We conjecture that it is hard to write a program that 



can extract all the features from the given features. But how about the problem 
of generating or finding out images satisfying a given question automatically? 
This might not be possible for all questions. In the next scheme we try to create 
a protocol that can automatically generate instances. 

Matrix Interpretation It would be interesting to know whether this protocol 
can be presented in a matrix representation. We could represent it as Vx = a. 
However, the actual answer sequence is not the same as a in this case. Thus we 
can represent it as: Vx = pl (a) where p^ (a) is a permutation operation on a 
taking its L components at a time. However, we still have the case that some 
of the bits in a are random. This becomes a problem similar to the Learning 
Parity with Noise (LPN) problem presented in the HB protocol [7]. Again, the 
real problem in our scheme is to extract all features from the pictures and thus 
constructing a correct matrix V. 

4.3 A Practical Scheme 

In this scheme, we would use the Gimpy CAPTCHA [17]. Gimpy works by 
picking several words from a dictuonary, distorting the text of these words and 
presenting the resulting words in the form of an image in front of a human user. 
The idea is that the current computer programs cannot comprehend the text. We 
assume a dictionary of size N. The algorithm Gimpy (j) does the same thing as 
gimpy except that now it takes a desired number j, of words from the dictionary. 
Let L be a small positive number, e.g. 11. The whole image screen is divided into 
L boxes. Let s and t be non-negative integers modulo L, kept as secret between 
the user and the computer. An initial value Xo, another non- negative number 
modulo L, is also kept as a secret. A secret question q is constructed from the 
dictionary words. For example, q could be: "Are there more than three words 
begining with the letter "B"?". Let Gimpy (q,j) be the algorithm that takes j 
pictures from the dictionary such that the resulting challenge satisfies the ques- 
tion q, and let Gimpy (—q,j) be the one that does not satisfy q. Let Grid (L) be 
the procedure that concatenates L images (boxes) into one image in the form of 
a grid. The protocol is described as follows: 

Setup. Randomly generate integers s, t and xq modulo a public integer L, and 
share them as a secret between the server and the user. Share a secret question 
q from Q. 
Protocol. 

— For i = 1 to k, do: 

• Compute Xi = sXi_\ + t mod L. 

• Select a bit b uniformly at random and apply Gimpy (q,j) to the x^th 
box if b = 1 and Gimpy (—q,j) otherwise. 

• For each of the remaining boxes, apply Gimpy (j). 

• Apply Grid (L), and present it as a challenge to the user. 

• The user computes Xi = sxi-i + 1 mod L and submits a = q (a;,), where 
Xi denotes the Xith box. 



— Output accept if all answer's are correct, otherwise output reject. 

The protocol is described in the Figure 4. Here L = 4, s = 3, t — 3, x = 5. 
Thus xi = 2 and so the user looks at the picture labelled 2 and answers the 
question q — "Does the picture contain the names of at least two animals?" . The 
pictures are taken from the Gimpy webpage [17] How about the total amount 



a 


1 

mm* * % 







finswer | 1 
M DVB 10 Steps | 



Fig. 4. The Protocol Scheme with L = 4, s = 3, t — 3, xo = 5 and secret question 
q = "Does the picture contain the names of at least two animals?" 

of required by the adversary? We see that this protocol has the form of the 
previous protocol if we let L = L and 1 = 1. Thus the total amount of work 
required by the adversary is-^ti^^ log 2 (n (N)). The quantity n (N) denotes 
that the total number of features (possible questions) are a function of the size 
N of the dictionary. 

Discussion on the protocol The main theme of the protocol is to let the human 
adversary write down all the words presented in the image. Notice the use of the 
linear congruential generator Xi = sxi-i + t mod L, for small values of L and 
hence s and t. This is used to induce randomness in the selection of the secret box. 
Although this is not a cryptographically strong pseudorandom number generator, 
and certainly not for small values of L, the use is just there to inject some kind of 
randomness. Since the values of Xi are not shown in the clear, it is safe enough for 
our purposes. Notice that for some values of the parameters s, t and Xq, the a^'s 
do not span the whole set of integers modulo L. This again is not a worry as the 
adversary does not know which parameters are chosen. But the most important 



concern is: How much do the distorted images of texts and numbers describe? 
Unfortunately, the full spectrum of features described by a distorted image of 
a word from a dictionary is not much more than a natural image. Therefore, 
we should conclude that although the above scheme puts some autonomy in the 
challenge generation process, it is not as secure as the ones which contain natural 
images in terms of the workload on the adversary. Details follow in Section VI. 



Matrix Interpretation Once again, we are tempted to use a matrix representation 
for the protocol. We could represent it as p (V) x = a, where V represents the 
matrix of features of all the L pictures in every iteration. And p (V) represents 
the function which picks one of those L pictures according to chosen parameters 
of the linear congruential generator. Thus this protocol is opposite of the previous 
one in the sense that instead of diffusing the answer string, the pictures are sort 
of randomly chosen each time. 



4.4 Using Multiple Questions 

In [1] we proposed that we can have a group of questions as a secret connected 
by any combination of logical connectives, like AND, OR and NOT. However, we 
should not have greater than a certain number of logically connected features, 
because otherwise the workload on the legitimate user increases considerably. 
We could safely use a group of 3 or less questions. The adversary's algorithm 
described above and in the analysis will not work in this case, unless all the 
questions are connected by the logical AND. There are still ways to go around 
this, the adversary this time around looks for inconsistent features in the pic- 
ture and eliminates them in each iteration. More precisely, the adversary's task 
is to find a boolean function consisting of 3 or less literals that satisfies the 
truth assignments of all the literals. More precisely, let V denote the matrix 
consisting of features extracted from the pictures. Each column of this matrix 
represents the absence or presence of a feature in the corresponding picture. If 
the adversary also knows the answer vector a, then its job is to find a boolean 
function satisfying the mapping. We could use multiple questions in any three of 
the above protocols. The basic protocol is then reduced trivially to the problem 
of finding the boolean function defined above. The enhanced protocol however 
becomes a bit more tricky, since we do not know the exact evaluation of the ex- 
pression. The practical scheme also becomes hard as we do not know the literals 
being used in the evaluation of the boolean function. There can be two variants 
of this protocol; The basic protocol and the enhanced protocol. Obviously, the 
adversary can eliminate inconsistent features in the basic protocol easily as it 
does not involve any random replies. However, if we use the enhanced protocol 
with questions connected by logical operators, we can make the adversary's task 
harder as there would be random bits in the answer sequence as well. 



5 Experiments 



We did a few experiments in order to get an idea about the efficiency of our 
scheme. The experimental stage consisted of two main experiments: The first 
one was carried out to see how many distinct features can be extracted from 
a given picture; The second one was to check whether it is easy for a human 
user to tell whether a given feature is present in a picture or not. For the first 
experiment, we presented the image in Figure f to ten participants (all computer 
science graduate school students) . Each one of them were asked to write down as 
many features as they believed the picture contained. A commulative total of 42 
distinct features were extracted by the participants altogether, not counting the 
multiplicity of some features (such as the digits 1,2,.. .,9 are written together). 
These features are given in the Table 1 along with their frequency which means 
the number of participants who wrote down the corresponding feature: 



Table 1. Experimental Results 



Feature 


Freq & Res 


Feature 


Freq & Res 


Numbers (Digits 1-9) 


5 ✓✓✓ 


Digits without closed loop(s) 


1 ✓✓✓ 


Black color 


4 ✓✓✓ 


Digits with closed loop(s) 


1 ✓✓✓ 


Columns and rows sum to 15 


3 ✓✓✓ 


A heart 


1 t/Xt/ 


Diagonal sums to 15 


3 ✓✓✓ 


Sign board 


1 SXX 


Square(s) 


3 ✓✓✓ 


Triangle 


1 t/Xt/ 


Matrix 


3 ✓✓✓ 


Cross (es) 


1 t/X*/ 


3X3 matrix 


3 ✓✓✓ 


Line(s) 


1 ✓✓✓ 


White color 


2 ✓✓✓ 


Rectangle(s) 


1 %/SX 


Magic square 


2 ✓✓✓ 


Stair(s) 


1 ✓✓✓ 


Odd number(s) 


2 ✓✓✓ 


'+'sign 


1 ✓✓✓ 


Black line(s) 


2 ✓✓✓ 


Circle (s) 


1 ✓✓✓ 


Hook 


1 


Zig zag path 


1 t/Xt/ 


Slide 


1 


Alphabets C,S,L and 


1 v/Xv/ 


'X' sign 


1 SXX 


The string 492357816 


1 ✓✓✓ 


The string 438951276 


1 ✓✓✓ 


Distortion in slanted line(s) 


1 %SX*/ 


Table 


1 ✓Xi/ 


Array 


1 ✓✓✓ 


Balance 


1 XXX 


Equilibrium 


1 ✓xx 


Symmetry 


1 ✓✓✓ 


Complements 


1 SSX 


Typed digit (s) 


1 ✓✓✓ 


White area > Black area 


1 ✓✓✓ 


Even number(s) 


1 ✓✓✓ 


The right angle 


1 ✓✓✓ 


Mathematics 


1 ✓✓X 


Arithmetic 


1 SSX 



Once the features were collected, they were shown to three separate individ- 
uals not present in the first experiment. They were shown the picture and asked 
to answer whether the given list of features found by the participants in the first 
experiment were present in the picture or not. Their responses are shown in the 



column labeled " Freq and Res" where a i/represents that the corresponding par- 
ticipant believed the feature to be present in the picture. Not surprisingly, the 
features with the higher frequencies were answered correctly by all three users. 
On the other hand, some of the single frequency features were also answered 
correctly by all three users. The ones with indifferent answers are those that 
require a 'keen' eye, e.g. "hook". These experiments show that even a simple 
picture as the one shown in Figure 1 can have a lot of features, majority of 
which are very easy to answer but not so easy to extract. There can still be a lot 
more features present in the picture; one such example is " 'one side of a rubik's 
cube'". This survey gives us some guidelines while choosing the pictures and/or 
secret questions: 

— Do not use pictures whose main object is the secret feature. So for example, 
if we chose the picture of Figure 1 as the challenge picture, then the secret 
question of '"Does the picture contain the digits 1-9?"' will certainly be a 
bad choice. 

— Do not use simple pictures. Simple pictures contain very few features. This 
is evident from Figure 1. Although, one may still be able to think of more 
features, there does not seem to be a big number of features. 

— Do not use secret questions which are hard to answer by the legitimate user. 
As an example, the feature '"Equilibrium"' in the above table was answered 
'"no"' by two users. This seems hard to find out in the picture and needs 
more of a philosophical eye. 

— Always allow for user error. So for example, if the user replies '10' pictures, 
allow an error of 2 to 3 wrong answers. This is clear from the picture that a 
user answered '"no"' to the feature '"Mathematics"', even though it seems 
to be describe the figure. 

6 Analysis of the Enhanced Protocol 

In this section, we would like to analyze the relationship between the hidden 
permutation a and the secret question q in the enhanced protocol. The purpose 
of analysis is to find out the strength of the protocol if one of these secrets is 
leaked out. For the first part, let's assume that the adversary somehow found out 
the hidden permutation a but not the secret question. Since the adversary knows 
the hidden permutation, it knows exactly which questions are being answered 
and which one are being answered randomly. Thus the adversary can neglect 
the randomly answered pictures and use the pictures with correct answers to 
look for the secret question. Thus this transforms to the Basic Protocol in a 
straightforward manner. 

For the other side, let us assume the adversary H who knows the secret 
question q. It evaluates each picture for upto k iterations. Let -X", (t) be the 
variable representing the evaluated bit of the ith picture in the tth iteration 
of the enhanced protocol, where 1 < t < k. H thus evaluates the following 



information after k iterations: 



X 1 (t) X 2 (t) ■ ■ ■ X L (t) 
hi hi ■■■ hh 

6 2 1 &22 • ' ' &2L 



bkl bk2 ■ ■ ■ bkL 

where each bij is the bit evaluated by H for the corresponding picture. TL also 
has the following response table from the legitimate user: 

Yi (t) Y 2 (t) • • ■ Y L (t) 
an 0,12 ■ ■ ■ oil 

d21 022 ' ' ' CL2L 



Ikl a k2 ■ ■ ■ dkL 

where each Yi (t) represents the user's response bit to the ith picture in the tth 
iteration. The adversary now runs the following simple algorithm: 

— Initialize a (.) = null. 

— For each Xi (t) , check whether there exists a Yj (t) such that the two match 
at every corresponding bit position. 

• If there is only one such Yj (t) then mark a (j) = i. 

• If there are two such Yj (i)'s then halt. 

— Assign * to each unassigned position of a (.) 

— Output the permutation a and halt. 

We now state the following theorem: 
Theorem 1 

L 

< Pr [a is correct] < ( 1 — ^ ( 




Proof. Let Ai be the event that the adversary correctly guesses the ith position 
of the permutation a. Without loss of generality, we assume that the adversary 
starts with the left most position and goes on to the next position in sequential 
order. We have to find the probability: 

Pr [<7 is correct] = Pr [Ai] Pr L4 2 |^i] ■ ■ - Pr [A L \Ai A A 2 A . . . A A L -i] 

Now let Bi be the event that position i is not the don't care position. Let Bi be 
the complementary event. It is clear that: 

Pr [Ai] = Pr [Ai\Bi] Pr [Bi] + Pr [Ai\B~\ Pr [~B~\ 




Number of Iterations 



Fig. 5. Guessing Probability 



It is easy to see that: 



Pr [B x ] = yand Pr [Bl] = 1 - ^ 



If Bi is true then X\ (t) matches at least one of the Yj (t)'s . The adversary's 
algorithm will guess it correctly if there is only one such Yj (t). So: 

Pr L4i|Bi] = Pr [There exists only one j such that Yj (t) = X 1 (t)] 



= i- 



L — 1 times 

. fe \ L-l 



Now if B\ is true, then the algorithm will detect this if X\ it) does not match 
any of the Yj (i)'s. The probability of this being true is: 



Pr[AP5]-|i-(5 -1- 5 - i-y 



L times 

With this we get: 



If 1 /. i\ /. 1 s 



Let us calculate the probability Pr L4l|Ai A A 2 A ... A ^4l-i]- This is equal 

to: 

Pr [A L \Ai AA 2 A...AA L _ 1 ] = 

Pr [A l \Bl_/\ 4jAi 2 A...A A L ^] Pr [B L \Ai A A 2 A ... A A L _i] + 
Pr [A L |B L A Ai A A 2 A ... A A L _ X ] Pr [B L \A X A A 2 A ... A A L _i] 

It is straight forward to see that: 
Pr [B L |Ai A A 2 A ... A^ L _i] = and Pr \EL\Ai A A 2 A...A4h] = 1 - ±. 
We also have: 

Pr [Al\B l A ii A A 2 A . . . A A l _i] = 1 

and 

Pr [A L \B[ A A! A A 2 A . . . A Al-i] = 1 - git 
We get the final result: 



PrL4 L |Ai Ai 2 A...A4n] = 




Note that all other conditional probabilities for the events A 2 , A3, . . . , Al-i must 
lie between these two calculated probabilities. This gives us the upper and lower 
bounds for the probability of guessing the correct permutation as: 



7 Obtaining the Secret Question 

Let S = {1,2, ... ,n} denote the universal set of all features. Let A\ and A 2 
denote two subsets of this set. In actual, Ai and A 2 , denote the set of features of 
two pictures drawn randomly according to an arbitrary distribution. We assume 
that any feature i in S is equally likely to occur in any of the subsets drawn. We 
define the following two indicator variables: 

1 if i e Ai 

otherwise 

1 ifi G A 2 
otherwise 

Then we have that, \A\ H A 2 \ = h-h- The number of subsets of S containing 

i=l 

a given feature i would be 2™ _1 . Since each subset of S is equally likely to contain 
i regardless of the distribution with which the subset is drawn out, we have: 




Pr [At contains i] = Pr [A 2 contains i] = 2™~ 1 /2" = V 2 



From this and the fact that the two subsets are drawn independently of each 
other, we have: 



E [|Ai n A 2 \) = ^ E [kJi] =Y,E [Ii]E [Ji] 
i=\ i=l 

Now, E [U\ = h Pr i A i is chosen from 5] = 2™- 1 / 2 ™ = 1/2. Similarly, E [J l ] = 

AxCS 

1/2. Finally, this gives us: 

n 

E[\A 1 nA 2 \}=Y / E[I l }E[J i ]=n/4: 
»=i 

In general we can see that: 

71 I 

E [|Ai n A 2 n . . . n A t \] = — = -e [\a x n A 2 n . . . n A t ^\] 

We analyze the three protocols using the result obtained above. According to 
over discussion in Section III, the adversary will try to narrow down the number 
of possible secret features by using the algorithm GetBasicQ described in Section 
III. 

7.1 The Basic Protocol 

The adversary H looks at the current picture and its answer given by the user. 
It performs the procedure called GetBasicQ in described in Algorithm 1. Notice 
that "Compute Ah" means extracting the features of picture Ak- 



Algorithm 1 GetBasicQ 

Input: A set of pictures Ai, A2, . . . together with their answers 01, 02, . . . 
Output: The secret feature q 



1: if the answer bit position is then 
2: Wait for the next iteration. 
3: else 

4: Extract the features in the picture as A\ 
5: repeat 

6: For each picture k: 

7: if afe = 1 (The answer bit at position j then 

8: Compute Ak (The picture at position i) and assign A\ <— A\ n Ak. 

9: else 

10: As a k = (The answer bit at position j), so compute A k (The 

picture at position i) and assign A\ <— A\ — Ak. 
11: end if 

12: until |Ai| = 1 and halt. 

13: end if 



Next we compute the expected number of steps the adversary has to wait 
until he gets the above algorithm to halt. 



Theorem 1 If A successfully extracts all the features, then the expected number 
of steps is log 2 n 

( k) 

Proof. Let A\ be the set A\ after the fcth step. Our inductive proof is as follows: 
First assume that fc = 2. There are two cases: First if = 1 then as computed 
above: 



E 



A\ 



(2) 



= E 



A{ 1] nA 2 =n/2 



And, if ak = 0, we also get the result: 



E 



4 2) 



E 



A™ A 2 



= E 



A{ 1] nA c 2 =n/2 



This is true, since A 2 , the complement of A 2 , is also a subset of S. 
Now, in general for k = t, we have: 



E 



A? 



(*) 



n/2* 



So, if at — 1, then: 



E 



(t+i) 



E 



Ai n A t+1 



l -E 

2 



A- 



(t) 



2 t+i 



And, if a t = 0, then again: 



E 



A 



(t+i) 



E 



A? - A t+1 



= E 



4 } nA c t+1 



= 2 E 



(*) 



2 t+i 



»(fe) 



The adversary will stop for some k if E A\ 



^/2 fe = 1 k = \og 2 n 



1, this means that: 



Now, the number of iterations (pictures) for the adversary to observe are 
log 2 n. At each step the adversary has to extract the features of a picture, hence 
the expected amount of work at each step is n/2 and hence the total amount of 
work to be done by the adversary in the basic protocol is: \ log 2 n. How about the 
probability of success of the algorithm? We have assumed in this analysis that the 
adversary can extract all features in the image. In general, an adversary might 
not be able to extract everything in an image, including the secret feature. We 
can associate an average probability of p with the extraction secret feature, which 
shows that the secret will be extracted at an average probability of p whenever 
the adversary is presented with a picture with answer T'. The T' instances occur 
with an equal probability of 1/2. Therefore, the average probability of success 

log2 n 

of the above algorithm is: p 2 . In the special case, where the probability of 
the secret picture being extracted out is 1/2, the average probability would be: 



7.2 The Enhanced Protocol 



Now suppose the adversary Ti wants to find out the hidden question in the 
enhanced protocol. This time the adversary cannot use the simple procedure 
GetBasicQ it used for the basic protocol because of the use of the permutation a. 
It has to be selective in its choices. This time the adversary has to use a slightly 
modified version of GetBasicQ called GetEnhancedQ described in Algorithm 2. 



Algorithm 2 GetEnhancedQ 

Input: A > L set of pictures Ai,A2,... together with an equal number of bits 

dl, <J2, • • • 
Output: The secret feature q 



1: Select a random picture position i between 1 and L. 
2: Select a random answer position j between 1 and L. 
3: if the answer bit position is then 
4: Wait for the next iteration. 
5: else 

6: Extract the features in the picture as Ai 
7: repeat 

8: For each iteration k: 

9: if an = 1 then 

10: Compute Ak and assign Ai <— Ai n Ak. 

11: else 

12: As at — 0, so compute At and assign A\ <— A\ — Ak- 

13: end if 

14: until |Ai| = 1 and halt. 

15: end if 



Now assume that the adversary does the following: Whenever it has guessed 
an incorrect path, it executes the above algorithm for an expected number of 
log 2 n steps and then goes back again to choose a different path (This shows 
the expected time until the adversary realizes that it has chosen the wrong 
picture and answer pair). For each picture position 1 < i < L, there are a 
possible L answer positions. Each of these are equally likely for the adversary 
to pick. Out of these, only / result in halting the algorithm. If the adversary, 
found the correct path, it will stop by outputting the feature. This can occur 
with probability I / L 2 . If the adversary chooses the wrong path, it will go back 
again and choose another path. The total number of correct paths in the second 
iteration would be: I (L 2 — l) . Thus the probability of the adversary stopping 

after two iterations would bc:-^- (^j^Ejj- Continuing in this fashion, if we let 
Pr [yi] denote the probability of the adversary stopping at the zth step, we get: 

Pr ^ = ^(z^Tj-( L 2 -(i-l) J ' for z ^ 2 



Each of these paths results in the termination of the algorithm after a cer- 
tain number of steps by finally choosing the correct path. We can now find the 
expected number of steps of the adversary: 

Let Y denote the number of steps taken by the adversary. Thus yi G Y 
denotes the number of steps in the zth path. We get: 

E[Y] = ±\og 2 n+^ 




I (i 2 — l)l r-> l2_lT1 / I i - ~ I 

= z^ log2r V-i)!^=i y\ L 2_ l _ i+1)] 

l (L 2 - 1)1 L 2 (L 2 + l) (L 2 - 1)1 

~ l2 log2n (L 2 -l)! l(l + l)(L 2 -1)1 
i 2 + l, 

In light of the previous reult, the total amount of work done by the adversary is 
L ;+ t L 1 j log 2 n. The probability of success of the algorithm depends on the whether 
the adversary has chosen the correct combination of picture and answer position 
pair. Thus if we again let p be the average proabibility of successfully extracting 

log2 n 

the feature, the result come out to be the same as before: p~^~ . 



7.3 The Practical Scheme 

We could analyze the workload in the practical scheme by viewing the behavior 
of the linear congruential generator for small values of the arguments. However, 
for simplicity, we can assume that the position of the next picture is determined 
analogous to the previous protocol. Thus we can let L = L and I = 1 in our 
result for the previous protocol. What about the value of n? Ofcourse this should 
depend on the dictionary size iV which can be anywhere in the range of 10 3 to 
10 5 . However, n represents the number of distinct features and this could not be 
possibly more than N. We say this because two words might contain the same 
letters like "wolf" and "flow" , and two words might represent the same concept, 
like synonyms. Therefore we can assume n = xN where < x < 1. Assuming 
x = 0.5 we get the following result for adversary's work: E -^-^j^og 2 (y)- The 

log 2 n 

probability of success in this case would also come out to be p~^~ . 



7.4 Comparative Workloads of the Three Protocols 

Based on the results obtained in the previous subsections, we can show the 
comparative workloads on the adversary in the three protocols. First we show 
the workload by fixing n = 10 5 and N = 10 4 and plotting the three graphs as 
a function of L. In the enhanced scheme we have assumed I = [^] . The three 
plots are shown in Figure 6. 



22 r 

20 ■ 



Practical Zone Delimiter 



U- 

.1 



■ Basic 

- Practical 

- Enhanced 



20 



40 



60 



100 
L 



120 140 160 1E 



200 



Fig. 6. Comparative workloads of the adversary in the three schemes 



The workload of the adversary increases in the two schemes with the increas- 
ing value of L as compared to the basic protocol. Notice the use of "Practical 
Zone Delimiter" . This is placed at a value of L — 25, since we believe that putting 
more pictures in a given iteration would indeed place computational burden on 
the human user. For small values of L, we see that the Basic and the Enhanced 
schemes work better as compared to the Practical Scheme. As we increase L 
even beyond the practical zone, the practical scheme becomes better. But this is 
because we have fixed I to be one half L in the enhanced scheme. The advantage 
of the two schemes over the basic scheme does not come without a disadvantage. 
The memory and processing requirements of the other two schemes also increase 
with an increment in L. The following table shows the comparison: 

Figure 7, shows the comparative workloads as a function of n (N in the 
practical scheme case). The value of L is fixed at 20. The range of n is from 
10 3 to 10 5 and that of N is from 10 2 to 10 4 . Interestingly, the enhanced scheme 
becomes better with larger values of n as should be evident from the fact that 
N has a much smaller value as compared to n. 
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Fig. 7. Comparative workloads of the adversary in the three schemes with changing n 



Table 2. Qualitative comparison of the three schemes 



Scheme 


Image Evaluation 


Memory 


Computation 


Basic 

Enhanced 

Practical 


✓ 
✓ 
✓ 


q 

q and a (L, I) 

q and a, b, Xi all < L 


X 
X 

1 multiplication and 
addition modulo L 



Finally we show the interrelationship between I and L in the enhanced 
shceme. We see that the workload of the scheme increases significantly with 
lower values of I and higher values of L. This is shown in Figure 8. 



8 Conclusion and Future Work 

Human identification protocols can be a good alternative for the traditionally 
less secure password based systems. Over the years researchers have tried to con- 
struct efficient Human identification protocols which are secure against passive 
or active adversaries. However, the protocols run short in terms of efficiency and 
security. One such protocol was proposed by us in [1] and its security was based 
on the "conjectured difficulty" of obtaining the secret after observing some au- 
thentication sessions. In this paper, we have extended the work by giving some 
candidate alternative protocols, finding the exact harndess of these protocols in 
terms of the effort required by the human adversary as well as giving a detailed 
analysis of the protocol proposed in [1] . A brief survey regarding the number of 
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Fig. 8. Graph of the adversary's workload in the enhanced scheme with changing I and 
fixed L 

possible features in an image was also carried out. Our results show that a prac- 
tical implementation of the protocol might be feasile provided we are against a 
resource constrained human adversary. 

A notable future line of work is to device a similar protocol secure against 
active adversaries. This might involve sending challenges to the server by the 
human user similar to [19]. However, it remains an open problem whether we 
can fine tune the protocols so as to make them secure against active adversaries 
without increasing the workload on legitimate users. Another direction of future 
work is to come up with a different model for the distribution of features in 
images found on the web. This might give a close to realistic quantification of 
the workload on the adversary. 
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